Is Your LLM Really Mastering the Concept? A Multi-Agent Benchmark

News

2026-04 Added 5 new models to the leaderboard: xiaomi/mimo-v2-pro, xiaomi/mimo-v2-omni), minimax-m2.5, nemotron-3-super-120b-a12b, qwen3.6-plus.

2026-03 Added 4 new models to the leaderboard: step-3.5-flash, nemotron-3-nano-30b-a3b, trinity-large-preview, glm-4.5-air.

2026-02 Paper submitted to arXiv.

#	Model	Vendor	Attributes	Score	Rating Bar

Leaderboard updated 2025-03-18 • Each player starts at 0; consistently defeating 0-rated opponents converges to ~420 (strong-performance reference).

Abstract

Concepts serve as fundamental abstractions that support human reasoning and categorization. However, it remains unclear whether large language models truly capture such conceptual structures or primarily rely on surface-level pattern memorization. Existing benchmarks are largely static and fact oriented, which limits their ability to probe fine-grained semantic understanding and makes them vulnerable to data leakage and overfitting. To address this limitation, we introduce CK-Arena, a dynamic benchmark for conceptual knowledge evaluation based on a multi agent social deduction game, namely the Undercover game. In this setting, LLM based agents are assigned subtly different concept words and must describe, distinguish, and infer conceptual properties from others’ statements. Model performance is evaluated through both game level outcomes and the semantic quality of generated descriptions. Furthermore, CK-Arena leverages the interaction process to automatically construct high quality question answering data for fine grained diagnostic analysis. Experimental results show that conceptual understanding varies substantially across models and categories, and is not strictly aligned with overall model capability.

CK-Arena Demo: Undercover Game

Below is an interactive demonstration of the Undercover game used in CK-Arena. We first introduce the basic game rules to help you understand, and then demonstrate the interaction of intelligent agents in the first round of a real experiment. In this game, LLM agents are assigned either the main concept ("bee") or an undercover concept ("butterfly"). Players take turns making statements about their concept without revealing it directly. The goal for the civilians is to identify and eliminate the undercover agents through voting, while undercover agents try to blend in without being detected.

Game Flow :
1. Role Assignment: Players are randomly assigned as civilians or undercover agents, each receiving a similar but distinct concept.
2. Concept Description: In each round, players take turns describing their concept while trying to hide their identity and infer others’.
3. LLM Evaluation: Statements are scored by LLM judges based on novelty, relevance, and reasonableness.
4. Threshold-Based Elimination: If a player’s score falls below a predefined threshold, they are automatically eliminated.
5. Voting Round: After a fixed number of rounds, all surviving players vote to eliminate one player based on the dialogue so far.
6. Win Condition Check: The game ends when:
[All undercover agents are eliminated → civilians win] [Undercover agents equal civilians → undercover wins] [Maximum number of rounds is reached]

Experimental Results

We have presented some experimental results here. For more comprehensive and specific information about the experiments, please refer to the article.

Relevance scores of different LLMs across various categories. In this heatmap, the darker the color, the higher the score, intuitively reflecting the association between the descriptions and concepts of each LLM in different categories.

QA benchmark results. Three open-source models were selected for this evaluation, and the results reflected the specific performance of different models on a single task. The consistency between QA benchmark ranking and dynamic game ranking can also indirectly demonstrate the reliability of dynamic evaluation.

t-SNE visualizations of LLM statements across concept categories. Each plot shows model outputs for (a) Animals, (b) Food, and (c) Electronics. Repetitive descriptions, reflecting shallow understanding, appear as tightly clustered points, whereas richer knowledge produces more dispersed distributions. The visualizations also indicate that different LLMs center their descriptions on different focal aspects of a concept, suggesting variation in how conceptual knowledge is represented.

Interactive Knowledge Graphs

Explore concept relationships extracted from LLM-generated descriptions. Select a category to view its interactive knowledge graph.

🐾 Animals 🍔 Food 🔧 Tools 💻 Electronics 🌿 Plants ⚽ Sports 🏺 Artifacts ⛰️ Landforms 👥 People & Social ✏️ Stationery 👕 Clothing 📦 Miscellaneous

BibTeX

If you need to cite our work:

@article{xu2025CKarena,
  title={Is Your LLM Really Mastering the Concept? A Multi-Agent Benchmark},
  author={Shuhang Xu and Weijian Deng and Yixuan Zhou and Fangwei Zhong},
  journal={arXiv preprint arXiv:2505.17512},
  year={2026}
}